Week 8.4 - Transcription and Audio Analysis

Overview

Qualitative researchers, oral historians, social scientists, and educators all work with audio — interviews, focus groups, fieldwork recordings, classroom observations. AI transcription has genuinely transformed this work: tasks that once took hours of manual transcription can now be completed in minutes. But transcription AI also hallucinates — inventing words, sentences, and sometimes entire passages that were never spoken. For research that depends on accurate verbatim records, this is not a minor inconvenience.

This sub-lesson covers the real capabilities, the real failure modes, and — particularly important for this course — what these tools do and do not do with South African and other African languages. We then look at AI-assisted qualitative analysis tools and close with critical guidance on data privacy for audio recordings.

🎤 The Transcription Landscape

The current transcription tool landscape is split between open-source models (primarily the Whisper family) that run locally, commercial cloud APIs that offer enhanced accuracy and features, and specialist providers focused on African language support.

Tool	Type	Best Accuracy	Key Strength	Cost
Whisper large-v3 (OpenAI)	Open-source	~2.0% WER (clean audio)	Free, local, widely supported	Free
OpenAI Transcription API (GPT family)	API	Better than Whisper on standard benchmarks	Lowest word error rate; API access	Paid
WhisperX	Open-source	Comparable to Whisper large-v3	Enhanced speaker diarization; better multi-speaker	Free
Deepgram	Cloud API	Strong	Fast; good for meeting / call audio; speaker labels	Paid
AssemblyAI	Cloud API	Strong	Meeting transcription; speaker identification	Paid
Intron Sahara v2	Cloud API	57 languages (24 added in v2)	500+ African English accent variants	Paid
Lelapa AI	Cloud API	South African languages	Specialist: Zulu, Xhosa, Sotho, Afrikaans	Paid

📈 Accuracy — What the Numbers Mean

🔎 Understanding Word Error Rate (WER)

WER (Word Error Rate) is the standard metric for transcription accuracy. A 10% WER means 1 in every 10 words is incorrect. For a 1-hour interview of approximately 8,000 words, a 10% WER represents around 800 errors. For academic research, a 10% error rate means every transcript requires careful human review before coding or analysis. The metric combines substitutions (wrong word), deletions (dropped word), and insertions (invented word) into a single number.

Whisper large-v3 performance across audio conditions (recent benchmarks):

Audio Condition	Word Error Rate	What This Means
Clean read speech (LibriSpeech test-clean)	~2.0%	Impressively accurate; ~1 error per 50 words
Real-world mixed audio	7.88–11.46%	Requires review; ~1 error per 9–13 words
Meeting audio	~11.5%	Requires careful review; ~1 error per 9 words
Telephony / call centre quality	17.7%	Substantial errors; manual re-transcription often faster
Spontaneous speech, challenging conditions	~17–26% error	74–83% word accuracy; significant manual effort required

Note that the “clean read speech” benchmark (LibriSpeech) represents near-ideal conditions: studio recordings of people reading prepared text aloud. Most research audio is far closer to the “real-world mixed audio” or “meeting audio” conditions. A figure of approximately 2.7% is frequently quoted in secondary sources (typically referencing the earlier large-v2 model); the Whisper large-v3 model card reports approximately 2.0% on clean audio. Neither figure is representative of typical research use.

⚠️ Transcription Hallucination — The Critical Warning

⚠️ Transcription Hallucination Is a Real and Documented Problem

Investigative reporting in 2024 (Associated Press, TechCrunch) surfaced cases where Whisper hallucinated in approximately 80% of public meeting transcriptions tested by researchers. A separate peer-reviewed study — Koenecke et al. (FAccT 2024, “Careless Whisper”) — analysed hallucinations in clinical interview transcriptions and found that 38% included explicit harms such as perpetuating violence, inventing associations, or implying false authority. Together these findings confirm that transcription hallucination is not an occasional glitch: it is a systematic feature of how language model-based transcription works.

What transcription hallucination looks like:

❌ Invented Sentences

Complete sentences that were never spoken appear in the transcript — sometimes plausible-sounding extensions of what was said, sometimes entirely unrelated to the audio content. The model fills silence or uncertain audio with generated text.

🔃 Wrong Words

Homophones and phonetically similar sequences produce substitution errors. “Their” becomes “there”, technical terms are replaced with common words that sound similar. These errors are often plausible in context, making them harder to detect on first reading.

👁️ Fabricated Proper Nouns

Names of people, places, and organisations that the model does not recognise from its training data are sometimes replaced with invented approximations — a real person’s name becomes a different name that sounds similar. Technical vocabulary and field-specific jargon are particularly vulnerable.

🚫 Dropped Content

Entire sentences are omitted silently with no indication in the transcript that anything was missed. Deletion errors are often harder to detect than substitution errors because there is no visible evidence of the omission unless you are following along with the audio.

📌 When Hallucination Is Most Likely

Low-quality audio, background noise, or cross-talk between speakers increases hallucination rates substantially. Unusual terminology (technical vocabulary, field-specific jargon, specialist terms not represented in training data), accented speech that differs significantly from training data, quiet or mumbled speech, and multiple overlapping speakers all increase risk. Research interviews — especially fieldwork recordings in non-studio conditions — routinely combine several of these risk factors simultaneously.

📍 The Verification Imperative

Every transcription used for academic analysis must be reviewed by a human with access to the original audio. Spot-checking 10% of the transcript is not sufficient for verbatim accuracy. For critical passages — any passage you intend to quote directly, any passage that forms the evidential basis of an analytical claim — full verification against the audio is required. This is not optional for publishable research.

🇺🇪 South African and African Language Transcription

🌎 The South African Research Context

Standard transcription tools are trained predominantly on English and major European languages. Their performance on South African languages — and on South African English accents — is significantly lower than the headline accuracy figures suggest. This section gives you the honest performance picture, the tools designed to address this gap, and practical guidance for researchers working in multilingual South African contexts.

Representative ASR performance on South African languages, W2v-BERT model with 1 hour vs 50 hours of fine-tuning (Nahabwe et al., 2025, arXiv:2512.10968):

Language	WER (W2v-BERT, 1 hour fine-tuning)	WER (W2v-BERT, 50 hours fine-tuning)	Improvement
Afrikaans	22.7%	3.23%	≈86% reduction
Zulu	28.02%	7.89%	≈72% reduction
Xhosa	27.83%	7.13%	≈74% reduction

These figures are from Nahabwe et al. (2025, Deep Learning Indaba 2025), which benchmarks several architectures (Whisper, W2v-BERT, XLS-R, MMS) on African languages. Other models in the same study reach broadly comparable or better WER at 50 hours — e.g., Whisper Afrikaans at 50h is approximately 2.11%. The dramatic improvement from fine-tuning on 1–50 hours of target language data applies across architectures; the practical barrier is the requirement for labelled audio data in the target language, which remains scarce for many African languages. With only 1 hour of fine-tuning, a WER of ~28% on Zulu means roughly 1 in every 3.5 words is wrong — a level of error that makes automated analysis unreliable.

🌎 Dedicated African Language Tools

Several tools have been developed specifically to address the performance gap for African languages:

Intron Sahara v2 (2026): Covers 57 languages total (24 newly added in v2, including Zulu, Xhosa, and Afrikaans), with support for over 500 African English accent variants. Developed by a South African AI company. Commercial API. https://intron.ai
Lelapa AI: South African language specialist. Focuses on Zulu, Xhosa, Sotho, and Afrikaans. Developed with African language experts and linguists. https://lelapa.ai
Microsoft PazaBench: The first independent leaderboard for low-resource African language ASR (Automatic Speech Recognition), covering 39 languages and 52 models. Use this to compare current model performance across languages before committing to a tool.
AfriSpeech-Dialog (NAACL 2025): New benchmark for spontaneous conversational speech in African-accented English — the most relevant benchmark for the kind of accented English speech found in South African research interviews and focus groups where participants code-switch or speak English as a second language.

💡 Practical Guidance for South African Research Contexts

Standard Whisper large-v3 is a reasonable starting point for Afrikaans in quiet, clean recording conditions. For Zulu and Xhosa — especially in the naturalistic, spontaneous speech typical of qualitative interviews — fine-tuned variants or specialist tools (Intron Sahara v2, Lelapa AI) are strongly preferred over standard Whisper. For research that will be published or cited, budget for human verification of all transcripts regardless of the tool used. The performance gap between African languages and major European languages in standard tools is not a fixed property — it is a data problem, and it is improving as more labeled audio data becomes available. Check current benchmark leaderboards before beginning a new research project.

🔍 AI-Assisted Qualitative Analysis

Several qualitative data analysis (QDA) tools have integrated AI capabilities since 2023–2025. The integrations range from AI-suggested codes to conversational querying of your document corpus. Understanding what these features do and do not do is essential for using them responsibly.

💻 ATLAS.ti (Lumivero)

The most AI-integrated qualitative data analysis tool currently available. Includes AI transcription, AI-suggested codes, AI-generated summaries, and conversational querying across your document corpus.

AI coding suggestions are a starting point, not a final answer
They reflect statistical patterns in your documents, not your theoretical framework
Best used to generate initial codes that you then review, accept, modify, or reject
Conversational querying is powerful for finding relevant passages quickly

📊 NVivo 15.3 (Lumivero)

AI-generated summaries, AI auto-coding, and integration with statistical analysis. The Lumivero acquisition of ATLAS.ti (September 2024) has brought both tools under the same corporate umbrella, with increasing feature convergence across the two products.

AI auto-coding integrates with NVivo’s node structure
AI summaries work at document and node level
Statistical integration supports mixed-methods research designs

📄 MAXQDA Analytics Pro

AI coding suggestions, document summarisation, and paraphrasing. Lower AI integration than ATLAS.ti but more affordable for individual researchers or small teams. Well-suited to projects where you want AI support without full dependency on AI-generated outputs.

AI paraphrasing useful for memo writing
Coding suggestions require explicit human confirmation
Good export options for integration with other tools

💬 Insight7

Specialised platform for analysing interview transcripts and focus groups at scale. Extracts themes, sentiment, and key insights from audio, video, or text. Designed primarily for UX research and market research, but applicable to academic qualitative work where you need to process many interviews quickly.

Handles audio, video, and text input in a single workflow
Automated theme extraction across large interview sets
Less suited to theoretically-driven interpretive analysis

⚠️ What AI Auto-Coding Does and Does Not Do

AI can identify passages related to themes you specify, flag sentiment, and suggest codes based on semantic similarity to your existing codes or to prompts you provide. It cannot apply your theoretical framework, distinguish between a participant citing a concept and genuinely expressing it, or make the interpretive judgments that are the core of qualitative research. AI coding finds passages efficiently; your own judgment determines what those passages mean within your analytical framework. These are not interchangeable activities.

📋 A Conversational Workflow for Interview Analysis

The following workflow is informed by Friese's (2025) Conversational Analysis with AI (CAAI) framework, which replaces traditional line-by-line qualitative coding with structured dialogic interaction between the researcher and a large language model (Friese, S., 2025, “From Coding to Conversation: A New Methodological Framework for AI-Assisted Qualitative Analysis,” SSRN 5232579). It is designed for qualitative researchers who want to use AI to accelerate analysis without sacrificing interpretive rigour.

Familiarisation via AI-generated summaries. Use AI to produce a structured summary of each interview before your first read. This orients you to the content — the key topics discussed, the approximate emotional register, the main positions taken — without replacing close reading. Treat it as a map, not a substitute for the territory.
Develop your question set. Before querying AI, write out the specific analytical questions your research requires. The AI will answer whatever you ask; the quality of your questions determines the quality of the analysis. Vague questions produce vague answers. Questions grounded in your theoretical framework produce theoretically relevant responses.
Focused AI dialogue. Work with 4–6 interviews at a time rather than your entire corpus at once. This keeps the analysis tractable, allows you to verify AI responses against transcripts you know well, and makes it easier to check consistency across batches. Large-corpus querying is useful for pattern-finding; small-batch dialogue is better for interpretive work.
Synthesis of insights. Compile AI-generated themes across batches, explicitly looking for convergence and contradiction. Where the AI surfaces the same theme repeatedly, investigate whether this reflects genuine salience in your data or a bias in how you formulated your prompts.
Human theoretical integration. Apply your disciplinary framework to the AI-surfaced themes. Situate findings in your conceptual literature. Interrogate the themes against your research questions. This step is not automatable and should not be delegated to AI. The analysis becomes yours through this step.

🔒 Privacy and Data Governance

⚠️ Audio Recordings of Research Participants Are Sensitive Data

Before uploading audio recordings to any cloud-based transcription service — including Whisper API, AssemblyAI, Deepgram, or any QDA tool with cloud transcription — you must check three things:

🔒 Three-Point Data Governance Check

Your institution’s data governance policy: Many universities require that personal data (which includes voice recordings of identifiable individuals) be processed only on approved platforms or servers within specified jurisdictions. The UCT Research Data Management Policy is the relevant document for students and staff at this institution.
Your ethics approval: Your ethics certificate specifies how data may be collected, stored, and processed. If it does not explicitly authorise cloud-based processing, you may need an amendment before uploading recordings.
Your consent forms: Participants who consented to an interview did not necessarily consent to their voice being processed by a commercial AI service in a foreign jurisdiction. If your consent form does not cover this use, you cannot assume consent applies.

Local processing tools (Whisper running on your own machine) avoid this concern entirely — audio never leaves your device. WhisperX, Whisper.cpp, and Faster-Whisper all run locally and produce output comparable to the cloud Whisper API. For sensitive research data, local processing is the default recommendation.

📚 Readings

🎤 Core Reading 1

arXiv:2510.01145 (2025). “Automatic Speech Recognition for African Low-Resource Languages: A Systematic Literature Review.”
https://arxiv.org/abs/2510.01145

Comprehensive systematic review covering the current state of ASR for African languages, data availability, benchmark performance, and research gaps.

🎤 Core Reading 2

Nahabwe et al. (2025). “Benchmarking Automatic Speech Recognition Models for African Languages.” Deep Learning Indaba 2025.
https://arxiv.org/abs/2512.10968

Systematic comparison of Whisper, W2v-BERT, XLS-R, and MMS across African languages with varying amounts of fine-tuning data. The source for the WER tables in this sub-lesson.

🌐 Supplementary: Intron Sahara v2

Explore the language coverage and accuracy claims for this South African-developed ASR system. Sahara v2 (2026) covers 57 languages total, with 500+ accent variants.
https://intron.ai

🌐 Supplementary: Lelapa AI

South African language AI specialist. Review their Zulu, Xhosa, Sotho, and Afrikaans capabilities and the approach they take to developing models with African language expertise.
https://lelapa.ai

✅ Sub-Lesson 4 Summary

Accuracy landscape: Whisper large-v3 achieves approximately 2.0% WER on clean read speech but 11.5%+ on meeting audio. Real research audio is rarely clean. Always treat headline accuracy figures with caution.

Transcription hallucination: Documented across multiple studies — informal testing found hallucinations in approximately 80% of public meeting transcriptions; Koenecke et al. (FAccT 2024) found 38% of hallucinations in clinical transcriptions involved explicit harms. Invented sentences, fabricated proper nouns, and dropped content are all real failure modes. Every transcript used in research requires human verification against the original audio.

African language tools: Standard Whisper shows 22–28% WER on Afrikaans, Zulu, and Xhosa. Fine-tuning drops this dramatically. Intron Sahara v2 and Lelapa AI are specialist tools developed for South African language contexts and should be the first consideration for research involving these languages.

Qualitative analysis software: ATLAS.ti, NVivo, MAXQDA, and Insight7 all now offer AI coding and summarisation. AI finds passages; you determine what they mean.

Privacy: Audio recordings of participants require explicit ethics approval and consent cover before cloud processing. Use local tools (Whisper, WhisperX) for sensitive data.

Up next — Sub-Lesson 5: Video analysis and multimodal AI — combining audio, visual, and text modalities in research workflows.